GH-48744: [C++][Parquet] Fix undefined behavior in memcpy with nullptr#48745
GH-48744: [C++][Parquet] Fix undefined behavior in memcpy with nullptr#48745rynewang wants to merge 3 commits intoapache:mainfrom
Conversation
…nullptr When ByteArray has len=0 and ptr=nullptr (the default-constructed state), calling std::memcpy with nullptr is undefined behavior according to the C++ standard, even when size is 0. This commit adds guards to check len > 0 before calling memcpy in: - TypedStatisticsImpl<ByteArrayType>::Copy (statistics.cc) - DictDecoderImpl<ByteArrayType>::SetDict (decoder.cc) - Test utilities (statistics_test.cc, column_writer_test.cc) Claude-Generated-By: Claude Code (cli/claude-opus-4-5=100%) Claude-Steers: 1 Claude-Permission-Prompts: 8 Claude-Escapes: 1
|
|
|
BTW, are there other similar calls in the parquet folder? I just did a quick peek and it seems that arrow/cpp/src/parquet/column_writer.cc Line 919 in fe5e0e5 |
Co-authored-by: Gang Wu <ustcwg@gmail.com>
|
this is getting scattered all the places, so I think we can just provide a |
Introduce a centralized SafeMemcpy function in types.h that guards against calling std::memcpy with nullptr, which is undefined behavior even when size is 0. Replace memcpy calls throughout parquet with SafeMemcpy where the size is dynamic and could be 0 (ByteArray lengths, buffer sizes, RLE sizes). Keep std::memcpy for static sizes (sizeof, FLBA type_length) that are guaranteed > 0.
|
@wgtmac can you help merging this? Thanks! |
wgtmac
left a comment
There was a problem hiding this comment.
I'm not sure if it is a good idea to wrap memcpy though the performance might not affected if it is inlined. My concern is the inconsistently mixed use of the wrapper and the raw function across the code base which confuses developers on which to choose in the future.
Should we just consider using std::copy_n to replace memcpy in the parquet subdirectory where variable-length is copied to limit the scope of this change? In principle, std::copy_n is safer and can be optimized by compilers.
WDYT @pitrou?
| /// | ||
| /// According to the C++ standard, calling std::memcpy with a nullptr is undefined | ||
| /// behavior even when size is 0. This function guards against that case. | ||
| static inline void SafeMemcpy(void* dest, const void* src, size_t size) { |
There was a problem hiding this comment.
This might be the wrong place to put SafeMemcpy. It seems cpp/src/arrow/util/memory_internal.h is a better home.
There was a problem hiding this comment.
We already have arrow/util/ubsan.h and there's a handy function MakeNonNull there.
But I wouldn't be against adding a SafeMemcpy to that same file if it makes things easier.
Rationale for this change
As noted by @wgtmac in #48717 (comment), calling
std::memcpywith a nullptr source is undefined behavior according to the C++ standard, even when the size is 0.When
ByteArrayhaslen=0andptr=nullptr(the default-constructed state pertypes.h:649), the current code invokes UB.What changes are included in this PR?
Added guards to check
len > 0before callingmemcpyin:TypedStatisticsImpl<ByteArrayType>::Copyinstatistics.ccDictDecoderImpl<ByteArrayType>::SetDictindecoder.ccstatistics_test.ccandcolumn_writer_test.ccAre these changes tested?
Existing tests pass. The fix prevents potential UB that sanitizers or optimizing compilers could exploit.
Are there any user-facing changes?
No.